import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
score = pd.read_csv(r"C:\Users\Teni\Desktop\Git-Github\Datasets\Linear Regression\CGPA & SAT score.csv")
score.head()
SAT | GPA | |
---|---|---|
0 | 1714 | 2.40 |
1 | 1664 | 2.52 |
2 | 1760 | 2.54 |
3 | 1685 | 2.74 |
4 | 1693 | 2.83 |
Visually assess is there's a linear relationship between SAT and the GPAs
sns.scatterplot(data=score, x='GPA', y = 'SAT');
sns.regplot(data=score, x='GPA', y = 'SAT');
Define the X and y Variable
X = score['GPA']
y = score['SAT']
Say if someone had a GPA of 3.4, what is the predicted SAT Score
score
SAT | GPA | |
---|---|---|
0 | 1714 | 2.40 |
1 | 1664 | 2.52 |
2 | 1760 | 2.54 |
3 | 1685 | 2.74 |
4 | 1693 | 2.83 |
... | ... | ... |
79 | 1936 | 3.71 |
80 | 1810 | 3.71 |
81 | 1987 | 3.73 |
82 | 1962 | 3.76 |
83 | 2050 | 3.81 |
84 rows × 2 columns
# GPA = np.linspace(0, 10, 2)
GPA = 3.4
pred_SAT = 245.21763914*GPA + 1028.64068603
pred_SAT
1862.3806591060002
This means the predicted SAT score will be 1862
As displayed below, it shows that this is close to the real GPA score of those around 3.4 from the original data
score[score['GPA'] == 3.4]
SAT | GPA | |
---|---|---|
45 | 1925 | 3.4 |
46 | 1824 | 3.4 |
47 | 1956 | 3.4 |
Say if someone had a GPA of 2.91, what is the predicted SAT Score
GPA = 2.91
pred_SAT = 245.21763914*GPA + 1028.64068603
pred_SAT
1742.2240159274002
score[(score['GPA'] <= 3.0) & (score['GPA']>= 2.92)]
SAT | GPA | |
---|---|---|
6 | 1764 | 3.0 |
7 | 1764 | 3.0 |
Our predicted SAT score is not far from the real label in the data
Our model may not be spot on in comparism- but the residual difference is minimal.